PatientProfiler Package

PerfettoLab

2025-03-17

PatientProfiler

PatientProfiler is an R package that allows personalized analysis of an entire cohort and specific individuals. The main functions can be used sequentially or separately.

If we want to imagine a sequential way to use PatientProfiler, we can divide the pipeline into 4 main steps:

Install PatientProfiler

To install PatientProfiler execute the following command in R.

devtools::install_github('https://github.com/SaccoPerfettoLab/PatientProfiler/')

Set up Python PatientProfiler environment

dire come prerequisito di installare python

PatientProfiler contains some functions executing Python source code To correctly set up the Python PatientProfiler environment please execute the following code line the first time you use PatientProfiler:

install_pp_py()

Load necessary libraries

library(PatientProfiler)
library(SignalingProfiler)
library(tidyverse)

PREPARE YOUR DATA

To use this R package, you need to have the right format for your input data frames.

The input has to have specific columns, as described here:

STEP 1) RAW DATA HARMONIZATION

Description If the IDs of your samples start with a number, it is necessary to modify the IDs (e.g., with an ‘X’ before the number).

omics_update

This function allows to update UniProt IDs, add the sequence window, imputing NAs and calculating the Z-score.

To harmonize your raw omics data, you need to:

  1. Read your input dataframe(s);
  2. Format the input in the way described in the “Prepare your data” section;
  3. If you provide a transcriptomics dataframe as input, choose the percent threshold of zeros to filter rows with >= threshold% of zeros;
  4. If you provide a phosphoproteomics dataframe, choose the lenght of the sequence window. You can insert 7 or 15;
  5. Choose the imputation method for proteomics and phosphoproteomics;
  6. If you want to normalize with Z-score your data, insert TRUE for the zscore parameter, if not choose FALSE;
  7. In the case you select TRUE for the Z-score calculation, you can also set the method (by “column” or by “row”) and the metric (“median” or “mean”).

This step for big dataframes requires many hours of working mainly due the imputation. For this tutorial we are using a smaller amount of patients to give you a general vision of

###Example of usage:

transcriptomics_data <- read_tsv("./PatientProfiler_raw_input/Brca_transc.tsv") 
proteomics_data <- read_tsv("./PatientProfiler_raw_input/Brca_prot.tsv") 
phosphoproteomics_data <- read_tsv("./PatientProfiler_raw_input/Brca_phospho.tsv")  

omics_update(
             df_tr = NULL,   
             df_pr = NULL,   
             df_ph = phosphoproteomics_data,  
             threshold = 80,
             sw_len = 7,  
             pep_col_name = "Peptide",
             imp_method = "norm",   
             zscore = TRUE,   
             zmethod = "column",   
             metric = "median",
             output_dir = 'PatientProfiler_processed_input'
             )

The result of this step is each selected dataframe manipulated according to the chosen parameters. If, for example, the phosphoproteomics update was made, it will have the information regarding phosphosites updated with the sequence_window column and Uniprot_ID added, it will be imputed, and it will have a distribution based on whether or not Z-score should be calculated.

After this step, tables are ready for the omics_preparation.

ACCESS TO PRE-HARMONIZED CPTAC DATA

access_harmonized_CPTAC_data

If you want to analyze pre-harmonized data in the next steps, you can start with this function of PatientProfiler.

For this package we created an RDS file containing pre-harmonized data from CPTAC database.

The file includes transcritpomics, proteomics and phosphoproteomics data for 10 different tumors:

  • Brca - Breast cancer

  • Ccrcc - Clear cell renal cell carcinoma

  • Coad - Colon adenocarcinoma

  • Gbm - Glioblastoma

  • Hnscc -Head and neck squamous cell cacinoma

  • Lscc - Lung squamous cell carcinoma

  • Luad - Lung cancer

  • Ov - Ovarian cancer

  • Pdac - Pancreatic ductal adenocarcinoma

  • Ucec - Uterine corpus endometrial carcinoma

All the data has been harmonized using [omics_update] function.

Example of usage:

If you need to extract Brca and Ov phosphoproteomics and proteomics data:

access_harmonized_CPTAC_data(tumors = c("Brca", "Ov"), omics = c("phospho", "prot"))

The output is a local variable for each data frame you acceded.

INPUT TABLES TO INFER PROTEIN ACTIVITY AND NETWORK CONSTRUCTION

The next steps of PatientProfiler include wrapper functions to our in-house developed SignalingProfiler pipeline. As such, the harmonized data obtained with omics_update or access_harmonized_CPTAC_data should be transformed in SignalingProfiler-compliant format. To this aim, in PatientProfiler, we developed omics_preparation function.

omics_preparation

For each uploaded omic type, a folder with a user-defined name will be created. Each folder will contain a file (data frame) for each patient, representing the needed omic input needed in SignalingProfiler-compliant format for the next steps.

Example of usage:


tr_updated <- read_tsv("./vignette/PatientProfiler_processed_input/Transcriptomics_updated.tsv")
pr_updated <- read_tsv("./vignette/PatientProfiler_processed_input/Proteomics_updated.tsv")
ph_updated <- read_tsv("./vignette/PatientProfiler_processed_input/Phosphoproteomics_updated.tsv")

omics_preparation(
              df_tr_updated = tr_updated,           
              df_pr_updated = pr_updated, 
              df_ph_updated = ph_updated,             
              transc_dir_name = "Transc_patients",   
              prot_dir_name = "Prot_patients",                  
              phospho_dir_name = "Phospho_patients"
              )

Description To perform the following steps, the SignalingProfiler tool is used: you can check out its tutorial here and the publication regarding it here.

STEP 2) PROTEIN ACTIVITY INFERENCE

Infer signaling proteins activity at single patient level

extract_protein_activity

This function is a wrapper to SignalingProfiler first step functions, wherein protein activity modulation is inferred from multi-omics data combining footprint-based techniques with PhosphoScore algorithm.

To extract protein activity for a single patient, you need to upload transcriptomics, proteomics, or phosphoproteomics data (or all of them) and set the parameters as described in the information part of the function (see the description by typing “?extract_protein_activity” in your console).

The parameters are managed as a list of default parameters that you can change as you wish, either inside or outside the function call.

(Note: The prot_df parameter in tf_params and kin_params will only be used if a prot_file is specified, since the function automatically loads proteomics data only if correct_proteomics is TRUE and if it finds the specified file)

Example of usage:

tf_params <- list(reg_minsize = 5, collectri = TRUE) 
kin_params <- list(reg_minsize = 5)

patient_id <- 'CPT000814'

extract_protein_activity(
                prot_file = paste0("Prot_patients/Prot_Patient_", patient_id, ".tsv"),       
                trans_file = paste0("Transc_patients/Transc_Patient_", patient_id, ".tsv"),   
                phospho_file = paste0("Phospho_patients/Phospho_Patient_", patient_id, ".tsv"), 
                tf_params,
                kin_params,
                output_dir = "Activities"
                )

Results output

For each patient an .xlsx file (Activity_patient_{patient_id}.tsv) will be saved inside the output_dir directory (default, ‘Activities’), with these columns:

  • UNIPROT: protein Uniprot ID.

  • gene_name: name of the protein.

  • mf: molecular functions.

  • final_score: activity score.

  • method: how final_score is calculated.

Infer signaling proteins activity at cohort level

extract_cohort_activity

To perform the protein activity inference step for all the patients of a cohort, this function loops on each omic file for each patient and infer protein activity using [extract_protein_activity].

(Note: The prot_df parameter in tf_params and kin_params will only be used if a prot_dir is specified, since the function automatically loads proteomics data only if correct_proteomics is TRUE and if it finds files in the specified folder)

tf_params <- list(reg_minsize = 5, collectri = TRUE)
kin_params <- list(exp_sign = TRUE)

extract_cohort_activity(
                        prot_dir = "Prot_patients/", 
                        trans_dir = "Transc_patients/",
                        phospho_dir = "Phospho_patients/", 
                        tf_params, 
                        kin_params,
                        output_dir = "Activities"
                        )

Results output

The cohort final results will be saved as a set of .tsv files (1 file - 1 patient=) inside the output_dir directory (default, ‘Activities’). Each patient-specific file (Activity_patient_{patient_id}.tsv) contains these columns:

  • UNIPROT: protein Uniprot ID.

  • gene_name: name of the protein.

  • mf: molecular functions.

  • final_score: activity score.

  • method: how final_score is calculated.

STEP 3) NETWORK GENERATION

Create a signaling network for a single patient

In this step, PatientProfiler creates for each patient a causal network connecting mutated genes to a user-defined set of cellular functional traits (e.g., APOPTOSIS, PROLIFERATION).

Importantly, this step requires a mutations file where rows are patients and columns are genes, and each cell is the impact on protein function of the mutation.

To this aim, we implemented a wrapper to SignalingProfiler network creation step

create_network

This function sequentially calls five functions to:

  1. Select a PKN derived from SIGNOR and PhosphoSitePlus (get_PKN);

  2. Create a naive-network connecting mutations to inferred proteins (create_naive_network);

  3. Optimize the naive network over protein activity (optimize_network_with_carnival);

  4. Infer the activity of phenotypes from model proteins and connect them in a proteins-to-phenotypes network (infer_and_link_phenotypes);

  5. Manipulate the proteins-to-phenotypes model for visualization and functional circuits creation (format_patient_networks).

Each step inside network creation requires a set of parameters, extensively described in the functions documentation (type ?initialize_net_default_params” in your console. The following function is called internally and automatically set up the default parameters, as identified in SignalingProfiler benchmarking. However, the user can customize the parameters for each step modifying PKN_options, naive_options, carnival_options, phenoscore_options, format_options parameters list.

output_dir <- 'Networks_output'
network_params <- initialize_net_default_params(output_dir)

Example of usage:


patient_id <- 'CPT000814'

# Read mutation file (starting nodes of the model)
mutations_df <- read_tsv('./PatientProfiler_raw_input/Brca_mutations.tsv', show_col_types = F)
sources <- mutations_df[mutations_df$Patient_ID == patient_id, ]

# Read activity file (intermediate nodes of the model)
activities <- read_tsv(paste0('./Activities/Activity_Patient_', patient_id, '.tsv'), show_col_types = F)

# Set the functional traits vector (final nodes of the model)
desired_phenotypes <- c('APOPTOSIS', 'PROLIFERATION')

# Read omics file
proteomics <- read_tsv(paste0('./Prot_patients/Prot_Patient_', patient_id, '.tsv'), , show_col_types = F)
transcriptomics <- read_tsv(paste0('./Transc_patients/Transc_Patient_', patient_id, '.tsv'), show_col_types = F)
phosphoproteomics <- read_tsv(paste0('./Phospho_patients/Phospho_Patient_', patient_id, '.tsv'),show_col_types = F)

# ProxPath proteins-to-phenotypes pre-processing
pheno_distances_table <- proxpath_preprocessing(proteomics = proteomics, 
                                                phosphoproteomics = phosphoproteomics)

create_network(patient_id = patient_id,
              sources = sources,
              activities = activities,
              transcriptomics = transcriptomics,
              proteomics = proteomics,
              phosphoproteomics = phosphoproteomics,
              desired_phenotypes = desired_phenotypes,
              pheno_distances_table = pheno_distances_table,
              output_dir = 'Networks_output',
              save_all_files = FALSE,
              PKN_options = list(direct = FALSE),
              naive_options = list(layers = 2, max_length = c(1,4)),
              carnival_options = list(carnival_type = 'vanilla_one_shot'),
              phenoscore_options = list(), # keep default params
              format_options = list(optimize_on_phenotypes = FALSE,
                                    circuits_params = list(k = -1),
                                    vis_cytoscape = FALSE))

Create signaling networks at cohort level

create_cohort_network

This function loops on the folder of omics and activity data of a cohort of patients creating a network of each of the patients from mutations to desired phenotypes.

For the files generated for each patient, refer to create_network function documentation.

Example of usage:

create_cohort_networks(trans_dir = './Transc_patients/',
                      prot_dir = './Prot_patients/',
                      phospho_dir = './Phospho_patients/',
                      act_dir = './Activities/',
                      mut_file = './PatientProfiler_raw_input/Brca_mutations.tsv',
                      output_dir = './Networks_output/',
                      desired_phenotypes = c('APOPTOSIS', 'PROLIFERATION'),
                      pheno_distances_table = TRUE,
                      save_all_files = FALSE,
                      PKN_options = list(direct = FALSE),
                      naive_options = list(layers = 2, max_length = c(1,4)),
                      carnival_options = list(solver = 'cplex', carnival_type = 
                                                'vanilla_one_shot'),
                      format_options = list(optimize_on_phenotypes = FALSE,
                                    circuits_params = list(k = -1),
                                    vis_cytoscape = FALSE)
                      )

STEP 4) COMMUNITIES DETECTION

Generate communities from patient-specific mechanistic models

generate_communities

In this step, PatientProfiler extracts communities from the patient-specific mechanistic models generated in Step 3.

Importantly, this step requires a patient stratification file (patients_stratification.tsv), where each row represents a patient and the column indicates the subtype. ou need to place this file in a designated directory (dir_path), which will be deleted after this step.

Example of usage:


generate_communities(dir_path = "vignette/Communities/input_communities", 
                     network_dir = "vignette/Networks_output",
                     output_dir = "vignette/Communities/output_communities",
                     local = TRUE)

Results output

The output_dir directory contains a subdirectory for each detected community. Each community directory includes:

A file for edges. A file for nodes. A file listing the patients belonging to the community.

STEP 5) SIGNATURES IDENTIFICATIONS

Extract signaling-driven transcriptomics signature from communities

extract_signatures

Thanks to this function, we can extract signaling-driven transcriptomic signatures. To do this, you need the transcriptomics dataset obtained after Step 1 and the communities detected in Step 4.

Example of usage:


extract_signatures <- function(base_path = "./vignette/Communities/output_communities/",          
                               transcriptomics_file = "./vignette/PatientProfiler_processed_input/Transcriptomics_updated.tsv", 
                               output_dir = "./vignette",
                               padj_thres = 0.01,
                               diff_thres = 0.5,
                               mean_exp_clus_thres = 0,
                               max_val = 50)
  
  

Results output

The Signatures directory containing a dataframe for each signature with the genes list.